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Abstract —Objective methods for assessing perceptual image 
quality traditionally attempted to quantify the visibility of errors 
(differences) between a distorted image and a reference image 
using a variety of known properties of the human visual system. 
Under the assumption that human visual perception is highly 
adapted for extracting structural information from a scene, we 
introduce an alternative complementary framework for quality 
assessment based on the degradation of structural information. 
As a specific example of this concept, we develop a Structural 
Similarity Index and demonstrate its promise through a set of 
intuitive examples, as well as comparison to both subjective 
ratings and state-of-the-art objective methods on a database of 
images compressed with JPEG and JPEG2000. 1 

Index Terms —Error sensitivity, human visual system (HVS), 
image coding, image quality assessment, JPEG, JPEG2000, 
perceptual quality, structural information, structural similarity 
(SSIM). 


I. Introduction 

D IGITAL images are subject to a wide variety of distortions 
during acquisition, processing, compression, storage, 
transmission and reproduction, any of which may result in a 
degradation of visual quality. For applications in which images 
are ultimately to be viewed by human beings, the only “correct” 
method of quantifying visual image quality is through subjec¬ 
tive evaluation. In practice, however, subjective evaluation is 
usually too inconvenient, time-consuming and expensive. The 
goal of research in objective image quality assessment is to 
develop quantitative measures that can automatically predict 
perceived image quality. 

An objective image quality metric can play a variety of roles 
in image processing applications. First, it can be used to dy¬ 
namically monitor and adjust image quality. For example, a net- 

Manuscript received January 15. 2003; revised August 18, 2003. The work 
of Z. Wang and E. P. Simoncelli was supported by the Howard Hughes Med¬ 
ical Institute. The work of A. C. Bovik and H. R. Sheikh was supported by the 
National Science Foundation and the Texas Advanced Research Program. The 
associate editor coordinating the review of this manuscript and approving it for 
publication was Dr. Reiner Eschbach. 

Z. Wang and E. P. Simoncelli are with the Howard Hughes Medical 
Institute, the Center for Neural Science and the Courant Institute for Mathe¬ 
matical Sciences, New York University, New York, NY 10012 USA (e-mail: 
zhouwang@ieee.org; eero.simoncelli@nyu.edu). 

A. C. Bovik and H. R. Sheikh are with the Laboratory for Image and 
Video Engineering (LIVE), Department of Electrical and Computer Engi¬ 
neering, The University of Texas at Austin, Austin, TX 78712 USA (e-mail: 
bovik@ece.utexas.edu; hamid.sheikh@ieee.org). 

Digital Object Identifier 10.1109/TIP.2003.819861 

1 A MatLab implementation of the proposed algorithm is available online at 
http://www.cns.nyu.edu/~lcv/ssim/. 


work digital video server can examine the quality of video being 
transmitted in order to control and allocate streaming resources. 
Second, it can be used to optimize algorithms and parameter 
settings of image processing systems. For instance, in a visual 
communication system, a quality metric can assist in the op¬ 
timal design of prefiltering and bit assignment algorithms at the 
encoder and of optimal reconstruction, error concealment, and 
postfiltering algorithms at the decoder. Third, it can be used to 
benchmark image processing systems and algorithms. 

Objective image quality metrics can be classified according 
to the availability of an original (distortion-free) image, with 
which the distorted image is to be compared. Most existing 
approaches are known as full-reference, meaning that a com¬ 
plete reference image is assumed to be known. In many practical 
applications, however, the reference image is not available, and 
a no-reference or "blind" quality assessment approach is desir¬ 
able, In a third type of method, the reference image is only par¬ 
tially available, in the form of a set of extracted features made 
available as side information to help evaluate the quality of the 
distorted image. This is referred to as reduced-reference quality 
assessment. This paper focuses on full-reference image quality 
assessment. 

The simplest and most widely used full-reference quality 
metric is the mean squared error (MSE), computed by av¬ 
eraging the squared intensity differences of distorted and 
reference image pixels, along with the related quantity of peak 
signal-to-noise ratio (PSNR). These are appealing because they 
are simple to calculate, have clear physical meanings, and are 
mathematically convenient in the context of optimization. But 
they are not very well matched to perceived visual quality (e.g., 
[1]—[9]). In the last three decades, a great deal of effort has 
gone into the development of quality assessment methods that 
take advantage of known characteristics of the human visual 
system (HVS). The majority of the proposed perceptual quality 
assessment models have followed a strategy of modifying the 
MSE measure so that errors are penalized in accordance with 
their visibility. Section II summarizes this type of error-sensi¬ 
tivity approach and discusses its difficulties and limitations. In 
Section III, we describe a new paradigm for quality assessment, 
based on the hypothesis that the HVS is highly adapted for 
extracting structural information. As a specific example, we de¬ 
velop a measure of structural similarity (SSIM) that compares 
local patterns of pixel intensities that have been normalized 
for luminance and contrast. In Section IV, we compare the test 
results of different quality assessment models against a large 
set of subjective ratings gathered for a database of 344 images 
compressed with JPEG and JPEG2000. 
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Fig. 1. A prototypical quality assessment system based on error sensitivity. Note that the CSF feature can be implemented either as a separate stage (as shown) 
or within “Error Normalization.” 


II. Image Quality Assessment Based on 
Error Sensitivity 

An image signal whose quality is being evaluated can be 
thought of as a sum of an undistorted reference signal and an 
error signal. A widely adopted assumption is that the loss of 
perceptual quality is directly related to the visibility of the error 
signal. The simplest implementation of this concept is the MSE, 
which objectively quantifies the strength of the error signal. But 
two distorted images with the same MSE may have very dif¬ 
ferent types of errors, some of which are much more visible than 
others. Most perceptual image quality assessment approaches 
proposed in the literature attempt to weight different aspects of 
the error signal according to their visibility, as determined by 
psychophysical measurements in humans or physiological mea¬ 
surements in animals. This approach was pioneered by Mannos 
and Sakrison [10], and has been extended by many other re¬ 
searchers over the years. Reviews on image and video quality 
assessment algorithms can be found in [4] and [11]—[ 13]. 

A. Framework 

Fig. 1 illustrates a generic image quality assessment frame¬ 
work based on error sensitivity. Most perceptual quality assess¬ 
ment models can be described with a similar diagram, although 
they differ in detail. The stages of the diagram are as follows. 

• Pre-processing: This stage typically performs a variety of 
basic operations to eliminate known distortions from the 
images being compared. First, the distorted and reference 
signals are properly scaled and aligned. Second, the signal 
might be transformed into a color space (e.g., [14]) that 
is more appropriate for the HVS. Third, quality assess¬ 
ment metrics may need to convert the digital pixel values 
stored in the computer memory into luminance values of 
pixels on the display device through pointwise nonlinear 
transformations. Fourth, a low-pass filter simulating the 
point spread function of the eye optics may be applied. 
Finally, the reference and the distorted images may be 
modified using a nonlinear point operation to simulate 
light adaptation. 

• CSF Filtering: The contrast sensitivity function (CSF) 
describes the sensitivity of the HVS to different spatial 
and temporal frequencies that are present in the visual 
stimulus. Some image quality metrics include a stage that 
weights the signal according to this function (typically 
implemented using a linear filter that approximates the 
frequency response of the CSF). However, many recent 
metrics choose to implement CSF as a base-sensitivity 
normalization factor after channel decomposition. 


• Channel Decomposition: The images are typically sepa¬ 
rated into subbands (commonly called "channels" in the 
psychophysics literature) that are selective for spatial and 
temporal frequency as well as orientation. While some 
quality assessment methods implement sophisticated 
channel decompositions that are believed to be closely 
related to the neural responses in the primary visual cortex 
[2], [ 15]—[ 19], many metrics use simpler transforms such 
as the discrete cosine transform (DCT) [20], [21] or 
separable wavelet transforms [22]—[24]. Channel decom¬ 
positions tuned to various temporal frequencies have also 
been reported for video quality assessment [5], [25]. 

• Error Normalization: The error (difference) between 
the decomposed reference and distorted signals in each 
channel is calculated and normalized according to a 
certain masking model, which takes into account the fact 
that the presence of one image component will decrease 
the visibility of another image component that is prox¬ 
imate in spatial or temporal location, spatial frequency, 
or orientation. The normalization mechanism weights the 
error signal in a channel by a space-varying visibility 
threshold [26]. The visibility threshold at each point is 
calculated based on the energy of the reference and/or 
distorted coefficients in a neighborhood (which may 
include coefficients from within a spatial neighborhood 
of the same channel as well as other channels) and the 
base-sensitivity for that channel. The normalization 
process is intended to convert the error into units of just 
noticeable difference (JND). Some methods also consider 
the effect of contrast response saturation (e.g., [2]). 

• Error Pooling: The final stage of all quality metrics must 
combine the normalized error signals over the spatial 
extent of the image, and across the different channels, 
into a single value. For most quality assessment methods, 
pooling takes the form of a Minkowski norm as follows: 

(d 

where cij, is the normalized error of the fc-th coefficient 
in the /th channel, and (3 is a constant exponent typically 
chosen to lie between 1 and 4. Minkowski pooling may be 
performed over space (index k) and then over frequency 
(index l), or vice versa, with some nonlinearity between 
them, or possibly with different exponents /3. A spatial 
map indicating the relative importance of different regions 
may also be used to provide spatially variant weighting 
[25], [27], [28], 
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B. Limitations 

The underlying principle of the error-sensitivity approach 
is that perceptual quality is best estimated by quantifying the 
visibility of errors. This is essentially accomplished by simu¬ 
lating the functional properties of early stages of the HVS, as 
characterized by both psychophysical and physiological exper¬ 
iments. Although this bottom-up approach to the problem has 
found nearly universal acceptance, it is important to recognize 
its limitations. In particular, the HVS is a complex and highly 
nonlinear system, but most models of early vision are based 
on linear or quasilinear operators that have been characterized 
using restricted and simplistic stimuli. Thus, error-sensitivity 
approaches must rely on a number of strong assumptions and 
generalizations. These have been noted by many previous 
authors, and we provide only a brief summary here. 

• The Quality Definition Problem : The most fundamental 
problem with the traditional approach is the definition of 
image quality. In particular, it is not clear that error visi¬ 
bility should be equated with loss of quality, as some dis¬ 
tortions may be clearly visible but not so objectionable. 
An obvious example would be multiplication of the image 
intensities by a global scale factor. The study in [29] also 
suggested that the correlation between image fidelity and 
image quality is only moderate. 

• The Suprathreshold Problem. The psychophysical exper¬ 
iments that underlie many error sensitivity models are 
specifically designed to estimate the threshold at which a 
stimulus is just barely visible. These measured threshold 
values are then used to define visual error sensitivity 
measures, such as the CSF and various masking effects. 
However, very few psychophysical studies indicate 
whether such near-threshold models can be generalized 
to characterize perceptual distortions significantly larger 
than threshold levels, as is the case in a majority of image 
processing situations. In the suprathreshold range, can 
the relative visual distortions between different channels 
be normalized using the visibility thresholds? Recent 
efforts have been made to incorporate suprathreshold 
psychophysics for analyzing image distortions (e.g., 
[30]—[34]). 

• The Natural Image Complexity Problem. Most psy¬ 
chophysical experiments are conducted using relatively 
simple patterns, such as spots, bars, or sinusoidal gratings. 
For example, the CSF is typically obtained from threshold 
experiments using global sinusoidal images. The masking 
phenomena are usually characterized using a superposi¬ 
tion of two (or perhaps a few) different patterns. But all 
such patterns are much simpler than real world images, 
which can be thought of as a superposition of a much 
larger number of simple patterns. Can the models for 
the interactions between a few simple patterns gener¬ 
alize to evaluate interactions between tens or hundreds 
of patterns? Is this limited number of simple-stimulus 
experiments sufficient to build a model that can predict 
the visual quality of complex-structured natural images? 
Although the answers to these questions are currently not 
known, the recently established Modelfest dataset [35] 
includes both simple and complex patterns, and should 
facilitate future studies. 
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• The Decorrelation Problem. When one chooses to use a 
Minkowski metric for spatially pooling errors, one is im¬ 
plicitly assuming that errors at different locations are sta¬ 
tistically independent. This would be true if the processing 
prior to the pooling eliminated dependencies in the input 
signals. Empirically, however, this is not the case for linear 
channel decomposition methods such as the wavelet trans¬ 
form. It has been shown that a strong dependency exists 
between intra- and inter-channel wavelet coefficients of 
natural images [36], [37]. In fact, state-of-the-art wavelet 
image compression techniques achieve their success by 
exploiting this strong dependency [38]—[41 ]. Psychophys- 
ically, various visual masking models have been used to 
account for the interactions between coefficients [2], [42]. 
Statistically, it has been shown that a well-designed non¬ 
linear gain control model, in which parameters are opti¬ 
mized to reduce dependencies rather than for fitting data 
from masking experiments, can greatly reduce the depen¬ 
dencies of the transform coefficients [43], [44]. In [45], 
[46], it is shown that optimal design of transformation and 
masking models can reduce both statistical and percep¬ 
tual dependencies. It remains to be seen how much these 
models can improve the performance of the current quality 
assessment algorithms. 

• The Cognitive Interaction Problem. It is widely known 
that cognitive understanding and interactive visual pro¬ 
cessing (e.g., eye movements) influence the perceived 
quality of images. For example, a human observer will 
give different quality scores to the same image if s/he 
is provided with different instructions [4], [30]. Prior 
information regarding the image content, or attention 
and fixation, may also affect the evaluation of the image 
quality [4], [47]. But most image quality metrics do not 
consider these effects, as they are difficult to quantify and 
not well understood. 

III. Structural-Similarity-Based 
Image Quality Assessment 

Natural image signals are highly structured: their pixels 
exhibit strong dependencies, especially when they are spatially 
proximate, and these dependencies carry important information 
about the structure of the objects in the visual scene. The 
Minkowski error metric is based on pointwise signal differ¬ 
ences, which are independent of the underlying signal structure. 
Although most quality measures based on error sensitivity 
decompose image signals using linear transformations, these 
do not remove the strong dependencies, as discussed in the 
previous section. The motivation of our new approach is to find 
a more direct way to compare the structures of the reference 
and the distorted signals. 

A. New Philosophy 

In [6] and [9], a new framework for the design of image 
quality measures was proposed, based on the assumption that 
the human visual system is highly adapted to extract structural 
information from the viewing field. It follows that a measure of 
structural information change can provide a good approxima¬ 
tion to perceived image distortion. 
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Fig. 2. Comparison of “Boat” images with different types of distortions, all with MSE = 210. (a) Original image (8 bits/pixel; cropped from 512 X 512 to 256 
X 256 for visibility), (b) Contrast-stretched image, MSSIM = 0.9168. (c) Mean-shifted image, MSSIM = 0.9900. (d) JPEG compressed image, MSSIM = 
0.6949. (e) Blurred image, MSSIM = 0.7032. (f) Salt-pepper impulsive noise contaminated image, MSSIM = 0.7748. 


This new philosophy can be best understood through com¬ 
parison with the error sensitivity philosophy. First, the error 
sensitivity approach estimates perceived errors to quantify 
image degradations, while the new philosophy considers image 
degradations as perceived changes in structural information 
variation. A motivating example is shown in Fig. 2, where the 
original “Boat” image is altered with different distortions, each 
adjusted to yield nearly identical MSE relative to the original 
image. Despite this, the images can be seen to have dras¬ 
tically different perceptual quality. With the error sensitivity 
philosophy, it is difficult to explain why the contrast-stretched 
image has very high quality in consideration of the fact that its 
visual difference from the reference image is easily discerned. 
But it is easily understood with the new philosophy since 
nearly all the structural information of the reference image is 
preserved, in the sense that the original information can be 
nearly fully recovered via a simple pointwise inverse linear 
luminance transform (except perhaps for the very bright and 
dark regions where saturation occurs). On the other hand, some 
structural information from the original image is permanently 
lost in the JPEG compressed and the blurred images, and 
therefore they should be given lower quality scores than the 
contrast-stretched and mean-shifted images. 

Second, the error-sensitivity paradigm is a bottom-up 
approach, simulating the function of relevant early-stage com¬ 
ponents in the HVS. The new paradigm is a top-down approach, 
mimicking the hypothesized functionality of the overall HVS. 
This, on the one hand, avoids the suprathreshold problem 
mentioned in the previous section because it does not rely on 


threshold psychophysics to quantify the perceived distortions. 
On the other hand, the cognitive interaction problem is also 
reduced to a certain extent because probing the structures of 
the objects being observed is thought of as the purpose of the 
entire process of visual observation, including high level and 
interactive processes. 

Third, the problems of natural image complexity and decor¬ 
relation are also avoided to some extent because the new 
philosophy does not attempt to predict image quality by accu¬ 
mulating the errors associated with psychophy sically understood 
simple patterns. Instead, the new philosophy proposes to eval¬ 
uate the structural changes between two complex-structured 
signals directly. 

B. The SSIM Index 

We construct a specific example of a SSIM quality measure 
from the perspective of image formation. A previous instantia¬ 
tion of this approach was made in [6]-[8] and promising results 
on simple tests were achieved. In this paper, we generalize this 
algorithm, and provide a more extensive set of validation results. 

The luminance of the surface of an object being observed is 
the product of the illumination and the reflectance, but the struc¬ 
tures of the objects in the scene are independent of the illumi¬ 
nation. Consequently, to explore the structural information in 
an image, we wish to separate the influence of the illumination. 
We define the structural information in an image as those at¬ 
tributes that represent the structure of objects in the scene, inde¬ 
pendent of the average luminance and contrast. Since luminance 
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Fig. 3. Diagram of the structural similarity (SSIM) measurement system. 


and contrast can vary across a scene, we use the local luminance 
and contrast for our definition. 

The system diagram of the proposed quality assessment 
system is shown in Fig. 3. Suppose x and y are two nonnegative 
image signals, which have been aligned with each other (e.g., 
spatial patches extracted from each image). If we consider 
one of the signals to have perfect quality, then the similarity 
measure can serve as a quantitative measurement of the quality 
of the second signal. The system separates the task of similarity 
measurement into three comparisons: luminance, contrast and 
structure. First, the luminance of each signal is compared. As¬ 
suming discrete signals, this is estimated as the mean intensity 

1 N 

lh: = N X/' 7 *- (2) 

2=1 

The luminance comparison function Z(x, y) is then a function 
of fj, x and ^ y . 

Second, we remove the mean intensity from the signal. In 
discrete form, the resulting signal x — corresponds to the 
projection of vector x onto the hyperplane defined by 

N 

5>i = 0. (3) 

2=1 

We use the standard deviation (the square root of variance) as an 
estimate of the signal contrast. An unbiased estimate in discrete 
form is given by 

or = j- - • (4) 

The contrast comparison c(x, y) is then the comparison of cr x 
and Gy. 

Third, the signal is normalized (divided) by its own standard 
deviation, so that the two signals being compared have unit stan¬ 
dard deviation. The structure comparison s(x, y) is conducted 
on these normalized signals (x — fJ, x )/cr x and (y — Ah/)/ 0 )/- 


Finally, the three components are combined to yield an 
overall similarity measure 


S(x,y) = /(Z(x,y),c(x,y),s(x,y)). (5) 


An important point is that the three components are relatively 
independent. For example, the change of luminance and/or con¬ 
trast will not affect the structures of images. 

In order to complete the definition of the similarity measure 
in (5), we need to define the three functions Z(x, y), c(x, y), and 
,s(x, y), as well as the combination function /(•). We also would 
like the similarity measure to satisfy the following conditions. 

1) Symmetry: S(x,y) = S(y,x). 

2) Boundedness: S(x,y) < 1. 

3) Unique maximum: S(x,y) = 1 if and only if x = y (in 
discrete representations, Xi = yi for all* = 1.2,---. N). 

For luminance comparison, we define 


f(x, y) 


2 /he Ah/ + Cl 

fJ'x + Ry + Cl 


( 6 ) 


where the constant C\ is included to avoid instability when /// + 
yity is very close to zero. Specifically, we choose 


Cr = (KiL ) 2 


(7) 


where L is the dynamic range of the pixel values (255 for 8-bit 
grayscale images), and K\ <C 1 is a small constant. Similar 
considerations also apply to contrast comparison and structure 
comparison described later. Equation (6) is easily seen to obey 
the three properties listed above. 

Equation (6) is also qualitatively consistent with Weber’s law, 
which has been widely used to model light adaptation (also 
called luminance masking) in the HVS. According to Weber’s 
law, the magnitude of a just-noticeable luminance change A I is 
approximately proportional to the background luminance I for 
a wide range of luminance values. In other words, the HVS is 
sensitive to the relative luminance change, and not the absolute 
luminance change. Letting R represent the size of luminance 
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change relative to background luminance, we rewrite the lumi¬ 
nance of the distorted signal as p y = (1 + R)p x . Substituting 
this into ( 6 ) gives 


J(x,y) 


2(1 + R ) 

l + (l + i 2 ) 2 +%‘ 

r*x 


( 8 ) 


If we assume C\ is small enough (relative to p, 2 ) to be ignored, 
then /(x. y) is a function only of R, qualitatively consistent with 
Weber’s law. 

The contrast comparison function takes a similar form 


c(x,y) 


2 <T x (Ty + Co 

a x + a y + ^2 


(9) 


where Co = (KoL) 2 , and Ko <C 1. This definition again sat¬ 
isfies the three properties listed above. An important feature of 
this function is that with the same amount of contrast change 
Act = tj y — a x , this measure is less sensitive to the case of high 
base contrast a x than low base contrast. This is consistent with 
the contrast-masking feature of the HVS. 

Structure comparison is conducted after luminance subtrac¬ 
tion and variance normalization. Specifically, we associate the 
two unit vectors (x — p, x ) /<%,. and (y — p y )/cty, each lying in 
the hyperplane defined by (3), with the structure of the two im¬ 
ages. The correlation (inner product) between these is a simple 
and effective measure to quantify the structural similarity. No¬ 
tice that the correlation between (x—p x )/a x and (y —p y )/ct y is 
equivalent to the correlation coefficient between x and y. Thus, 
we define the structure comparison function as follows: 


/ \ &xy + Cs 

s (^y) = n „ I r • 


( 10 ) 


As in the luminance and contrast measures, we have introduced 
a small constant in both denominator and numerator. In discrete 
form, a xy can be estimated as 


&xy — 



N 


'y ] ( x i A* x)(jji My)- 


(ID 


Geometrically, the correlation coefficient corresponds to the co¬ 
sine of the angle between the vectors x — p x and y — p y . Note 
also that s(x, y) can take on negative i’ s. 

Finally, we combine the three comparisons of ( 6 ), (9) and 
(10) and name the resulting similarity measure the SSIM index 
between signals x and y 

SSIM(x,y) = [Z(x,y)f • [c(x,y)f • [s(x,y)f ( 12 ) 


where a > 0 , /3 > 0 and 7 > 0 are parameters used to adjust the 
relative importance of the three components. It is easy to verify 
that this definition satisfies the three conditions given above. In 
order to simplify the expression, we set a = (3 = 7 = 1 and 
C-\ = Col2 in this paper. This results in a specific form of the 
SSIM index 


SSIM(x, y) 


(2/x x ./x.y T C 1 )(2u xy T Co ) 

(/4 + My + Ci) (a 2 + (J 2 + C 2 )' 


(13) 


The “universal quality index” (UQI) defined in [ 6 ] and [7] cor¬ 
responds to the special case that C\ = C 2 = 0 , which produces 
unstable results when either (/x 2 + p 2 ) or (cr 2 +cr 2 ) is very close 
to zero. 


The relationship between the SSIM index and more tradi¬ 
tional quality metrics may be illustrated geometrically in a 
vector space of image components. These image components 
can be either pixel intensities or other extracted features such as 
transformed linear coefficients. Fig. 4 shows equal-distortion 
contours drawn around three different example reference 
vectors, each of which represents the local content of one 
reference image. For the purpose of illustration, we show only 
a two-dimensional space, but in general the dimensionality 
should match the number of image components being com¬ 
pared. Each contour represents a set of images with equal 
distortions relative to the enclosed reference image. Fig. 4(a) 
shows the result for a simple Minkowski metric. Each contour 
has the same size and shape (a circle here, as we are assuming 
an exponent of 2). That is, perceptual distance corresponds to 
Euclidean distance. Fig. 4(b) shows a Minkowski metric in 
which different image components are weighted differently. 
This could be, for example, weighting according to the CSF, 
as is common in many models. Here the contours are ellipses, 
but still are all the same size. These are shown aligned with the 
axes, but in general could be tilted to any fixed orientation. 

Many recent models incorporate contrast masking behaviors, 
which has the effect of rescaling the equal-distortion contours 
according to the signal magnitude, as shown in Fig. 4(c). This 
may be viewed as a type of adaptive distortion metric: it de¬ 
pends not just on the difference between the signals, but also 
on the signals themselves. Fig. 4(d) shows a combination of 
contrast masking (magnitude weighting) followed by compo¬ 
nent weighting. Our proposed method, on the other hand, sep¬ 
arately computes a comparison of two independent quantities: 
the vector lengths, and their angles. Thus, the contours will be 
aligned with the axes of a polar coordinate system. Figs. 4(e) 
and 4(f) show two examples of this, computed with different 
exponents. Again, this may be viewed as an adaptive distortion 
metric, but unlike previous models, both the size and the shape 
of the contours are adapted to the underlying signal. Some re¬ 
cent models that use divisive normalization to describe masking 
effects also exhibit signal-dependent contour orientations (e.g., 
[45], [46], [48]), although precise alignment with the axes of a 
polar coordinate system as in Fig. 4(e) and (f) is not observed in 
these methods. 

C. Image Quality Assessment Using SSIM Index 

For image quality assessment, it is useful to apply the SSIM 
index locally rather than globally. First, image statistical fea¬ 
tures are usually highly spatially nonstationary. Second, image 
distortions, which may or may not depend on the local image 
statistics, may also be space-variant. Third, at typical viewing 
distances, only a local area in the image can be perceived with 
high resolution by the human observer at one time instance 
(because of the foveation feature of the HVS [49], [50]). And 
finally, localized quality measurement can provide a spatially 
varying quality map of the image, which delivers more infor¬ 
mation about the quality degradation of the image and may be 
useful in some applications. 

In [ 6 ] and [7], the local statistics p x , a x and a xy are computed 
within a local 8 x 8 square window, which moves pixel-by-pixel 
over the entire image. At each step, the local statistics and SSIM 
index are calculated within the local window. One problem with 
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Fig. 4. Three example equal-distance contours for different quality metrics, (a) Minkowski error measurement systems, (b) Component weighted Minkowski error 
measurement systems, (c) Magnitude-weighted Minkowski error measurement systems, (d) Magnitude and component-weighted Minkowski error measurement 
systems, (e) The proposed system (a combination of (9) and (10)) with more emphasis on .s(x, y). (f) The proposed system [a combination of (9) and (10)] with 
more emphasis on c(x, y). Each image is represented as a vector, whose entries are image components. Note: this is an illustration in 2-D space. In practice, the 
number of dimensions should be equal to the number of image components used for comparison (e.g, the number of pixels or transform coefficients). 


this method is that the resulting SSIM index map often ex¬ 
hibits undesirable “blocking” artifacts. In this paper, we use 
an 11 x 11 circular-symmetric Gaussian weighting function 
w = {wi\i = 1,2, • • •, N}, with standard deviation of 1.5 sam¬ 
ples, normalized to unit sum 1 w i = !)• The estimates of 
local statistics fi x , cr x and a xy are then modified accordingly as 

N 

lh: = 7: WjXj (14) 

2—1 

n \ i 

^Wi(xi - ii x f I (15) 

i =1 / 

N 

Vxy = Y2 m(xi ~ H x ){Vi ~ Hy)- (16) 

2=1 

With such a windowing approach, the quality maps exhibit a lo¬ 
cally isotropic property. Throughout this paper, the SSIM mea¬ 
sure uses the following parameter settings: K\ = 0.01; Ko = 
0.03. These values are somewhat arbitrary, but we find that in 
our current experiments, the performance of the SSIM index al¬ 
gorithm is fairly insensitive to variations of these values. 

In practice, one usually requires a single overall quality mea¬ 
sure of the entire image. We use a mean SSIM (MSSIM) index 
to evaluate the overall image quality 

M 

MSSIM(X, Y) = — Y, SSIM(xj,yj) (17) 
j =i 


where X and Y are the reference and the distorted images, re¬ 
spectively; Xj and y t are the image contents at the jth local 
window; and M is the number of local windows of the image. 
Depending on the application, it is also possible to compute a 
weighted average of the different samples in the SSIM index 
map. For example, region-of-interest image processing systems 
may give different weights to different segmented regions in 
the image. As another example, it has been observed that dif¬ 
ferent image textures attract human fixations with varying de¬ 
grees (e.g., [51], [52]). A smoothly varying foveated weighting 
model (e.g., [50]) can be employed to define the weights. In this 
paper, however, we use uniform weighting. A MatLab imple¬ 
mentation of the SSIM index algorithm is available online at 
[53], 

IV. Experimental Results 

Many image quality assessment algorithms have been shown 
to behave consistently when applied to distorted images created 
from the same original image, using the same type of distortions 
(e.g., JPEG compression). However, the effectiveness of these 
models degrades significantly when applied to a set of images 
originating from different reference images, and/or including a 
variety of different types of distortions. Thus, cross-image and 
cross-distortion tests are critical in evaluating the effectiveness 
of an image quality metric. It is impossible to show a thorough 
set of such examples, but the images in Fig. 2 provide an encour¬ 
aging starting point for testing the cross-distortion capability of 
the quality assessment algorithms. The MSE and MSSIM mea¬ 
surement results are given in the figure caption. Obviously, MSE 
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performs very poorly in this case. The MSSIM values exhibit 
much better consistency with the qualitative visual appearance. 

A. Best-Case/Worst-Case Validation 

We also have developed a more efficient methodology for 
examining the relationship between our objective measure 
and perceived quality. Starting from a distorted image, we 
ascend/descend the gradient of MSSIM while constraining 
the MSE to remain equal to that of the initial distorted image. 
Specifically, we iterate the following two linear-algebraic steps: 

(1) Y Y ± AP(X, Y)V y MSSIM(X, Y) 

(2) Y —> X + <tE(~X, Y) 

where a is the square root of the constrained MSE, A controls 
the step size, and E(X, Y) is a unit vector defined by 

£(X,Y)= 

and P(X, Y) is a projection operator 

P(X, Y) = I - P(X, Y)E t {X, Y) 

with I the identity operator. MSSIM is differentiable and this 
procedure converges to a local maximum/minimum of the ob¬ 
jective measure. Visual inspection of these best- and worst-case 
images, along with the initial distorted image, provides a vi¬ 
sual indication of the types of distortion deemed least/most im¬ 
portant by the objective measure. Therefore, it is an expedient 
and direct method for revealing perceptual implications of the 
quality measure. An example is shown in Fig. 5, where the ini¬ 
tial image is contaminated with Gaussian white noise. It can be 
seen that the local structures of the original image are very well 
preserved in the maximal MSSIM image. On the other hand, 
the image structures are changed dramatically in the worst-case 
MSSIM image, in some cases reversing contrast. 


original image 




gradient 

ascent 


gradient 

descent 



Fig. 5. Best- and worst-case SSIM images, with identical MSE. These are 
computed by gradient ascent/descent iterative search on MSSIM measure, 
under the constraint of fixed MSE = 2500. (a) Original image (100 X 100, 
8 bits/pixel, cropped from the “Boat” image), (b) Initial image, contaminated 
with Gaussian white noise (MSSIM = 0.3021). (c) Maximum MSSIM image 
(MSSIM = 0.9337). (d) Minimum MSSIM image (MSSIM = -0.5411). 


B. Test on JPEG and JPEG2000 Image Database 

We compare the cross-distortion and cross-image perfor¬ 
mances of different quality assessment models on an image 
database composed of JPEG and JPEG2000 compressed 
images. Twenty-nine high-resolution 24 bits/pixel RGB color 
images (typically 768 X 512 or similar size) were compressed 
at a range of quality levels using either JPEG or JPEG2000, 
producing a total of 175 JPEG images and 169 JPEG2000 
images. The bit rates were in the range of 0.150 to 3.336 
and 0.028 to 3.150 bits/pixel, respectively, and were chosen 
nonuniformly such that the resulting distribution of subjective 
quality scores was approximately uniform over the entire range. 
Subjects viewed the images from comfortable seating distances 
(this distance was only moderately controlled, to allow the 
data to reflect natural viewing conditions), and were asked to 
provide their perception of quality on a continuous linear scale 
that was divided into five equal regions marked with adjectives 
“Bad,” “Poor,” “Fair,” “Good,” and “Excellent.” Each JPEG 
and JPEG2000 compressed image was viewed by 13 ~ 20 
subjects and 25 subjects, respectively. The subjects were mostly 
male college students. 

Raw scores for each subject were normalized by the mean 
and variance of scores for that subject (i.e., raw values were 


converted to Z-scores [54]) and then the entire data set was 
rescaled to fill the range from 1 to 100. Mean opinion scores 
(MOSs) were then computed for each image, after removing 
outliers (most subjects had no outliers). The average standard 
deviations (for each image) of the subjective scores for JPEG, 
JPEG2000, and all images were 6.00, 7.33, and 6.65, respec¬ 
tively. The image database, together with the subjective score 
and standard deviation for each image, has been made available 
on the Internet at [55]. 

The luminance component of each JPEG and JPEG2000 
compressed image is averaged over local 2x2 window and 
downsampled by a factor of 2 before the MSSIM value is 
calculated. Our experiments with the current dataset show that 
the use of the other color components does not significantly 
change the performance of the model, though this should not 
be considered generally true for color image quality assess¬ 
ment. Unlike many other perceptual image quality assessment 
approaches, no specific training procedure is employed before 
applying the proposed algorithm to the database, because the 
proposed method is intended for general-purpose image quality 
assessment (as opposed to image compression alone). 

Figs. 6 and 7 show some example images from the database 
at different quality levels, together with their SSIM index maps 
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Fig. 6. Sample JPEG images compressed to different quality levels (original size: 768 x 512; cropped to 256 x 192 for visibility). The original (a) “Buildings, 
” (b) “Ocean,” and (c) “Monarch” images, (d) Compressed to 0.2673 bits/pixel, PSNR = 21.98 dB, MSSIM = 0.7118. (e) Compressed to 0.2980 bits/pixel, 
PSNR = 30.87 dB, MSSIM = 0.8886. (f) Compressed to 0.7755 bits/pixel, PSNR = 36.78 dB, MSSIM = 0.9898. (g), (h) and (i) show SSIM maps of 
the compressed images, where brightness indicates the magnitude of the local SSIM index (squared for visibility), (j), (k) and (1) show absolute error maps of the 
compressed images (contrast-inverted for easier comparison to the SSIM maps). 


and absolute error maps. Note that at low bit rate, the coarse 
quantization in JPEG and JPEG2000 algorithms often results 
in smooth representations of fine-detail regions in the image 
[e.g., the tiles in Fig. 6(d) and the trees in Fig. 7(d)]. Compared 
with other types of regions, these regions may not be worse 
in terms of pointwise difference measures such as the absolute 
error. However, since the structural information of the image 
details are nearly completely lost, they exhibit poorer visual 
quality. Comparing Fig. 6(g) with Fig. 6(j), and Fig. 7(g) with 
6 (j), we observe that the SSIM index is better in capturing such 


poor quality regions. Also notice that for images with intensive 
strong edge structures such as Fig. 7(c), it is difficult to reduce 
the pointwise errors in the compressed image, even at relatively 
high bit rate, as exemplified by Fig. 7(1). However, the com¬ 
pressed image supplies acceptable perceived quality as shown 
in Fig. 7(f). In fact, although the visual quality of Fig. 7(f) is 
better than Fig. 7(e), its absolute error map Fig. 7(1) appears to 
be worse than Fig. 7(k), as is confirmed by their PSNR values. 
The SSIM index maps Figs. 7(h) and 7(i) deliver better consis¬ 
tency with perceived quality measurement. 
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Fig. 7. Sample JPEG2000 images compressed to different quality levels (original size: 768 x 512; cropped to 256 x 192 for visibility). The original 
(a) “Stream,” (b) “Caps,” and (c) “Bikes” images, respectively, (d) Compressed to 0.1896 bits/pixel, PSNR = 23.46 dB, MS SIM = 0.7339. (e) Compressed 
to 0.1982 bits/pixel, PSNR = 34.56 dB, MSSIM = 0.9409. (f) Compressed to 1.1454 bits/pixel, PSNR = 33.47 dB, MSSIM = 0.9747. (g), (h) and (i) 
show SSIM maps of the compressed images, where brightness indicates the magnitude of the local SSIM index (squared for visibility), (j), (k) and (1) show 
absolute error maps of the compressed images (contrast-inverted for easier comparison to the SSIM maps). 


The quality assessment models used for comparison include 
PSNR, the well-known Sarnoff model, 2 UQI [7] and MSSIM. 
The scatter plot of MOS versus model prediction for each 
model is shown in Fig. 8. If PSNR is considered as a benchmark 
method to evaluate the effectiveness of the other image quality 
metrics, the Sarnoff model performs quite well in this test. This 
is in contrast with previous published test results (e.g., [57], 
[58]), where the performance of most models (including the 

2 Available at http://www.samoff.com/products_services/video_viseon/jnd- 
metrix/. 


Sarnoff model) were reported to be statistically equivalent to 
root mean squared error [57] and PSNR [58]. The UQI method 
performs much better than MSE for the simple cross-distortion 
test in [7], [8], but does not deliver satisfactory results in 
Fig. 8. We think the major reason is that at nearly flat regions, 
the denominator of the contrast comparison formula is close 
to zero, which makes the algorithm unstable. By inserting 
the small constants C\ and Co , MSSIM completely avoids 
this problem and the scatter slot demonstrates that it supplies 
remarkably good prediction of the subjective scores. 
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(a) (b) 



(c) 


(d) 


Fig. 8. Scatter plots of subjective mean opinion score (MOS) versus model prediction. Each sample point represents one test image, (a) PSNR. (b) Samoff model 
(using Sarnoff JNDmetrix 8.0 [55]). (c) UQI [7] (equivalent to MSSIM with square window and K-i = K 2 = 0). d) MSSIM (Gaussian window, K 1 = 0.01; 

Is. 2 — 0.03). 


In order to provide quantitative measures on the performance 
of the objective quality assessment models, we follow the per¬ 
formance evaluation procedures employed in the video quality 
experts group (VQEG) Phase I FR-TV test [58], where four 
evaluation metrics were used. First, logistic functions are used 
in a fitting procedure to provide a nonlinear mapping between 
the objective/subjective scores. The fitted curves are shown in 
Fig. 8. In [58], Metric 1 is the correlation coefficient between 
objective/subjective scores after variance-weighted regression 
analysis. Metric 2 is the correlation coefficient between objec¬ 
tive/subjective scores after nonlinear regression analysis. These 
two metrics combined, provide an evaluation of prediction ac¬ 
curacy. The third metric is the Spearman rank-order correlation 
coefficient between the objective/subjective scores. It is consid¬ 
ered as a measure of prediction monotonicity. Finally, metric 4 is 
the outlier ratio (percentage of the number of predictions outside 
the range of ±2 times of the standard deviations) of the predic¬ 
tions after the nonlinear mapping, which is a measure of predic¬ 
tion consistency. More details on these metrics can be found in 
[58]. In addition to these, we also calculated the mean absolute 
prediction error (MAE), and root mean square prediction error 
(rms) after nonlinear regression, and weighted mean absolute 
prediction error (WMAE) and weighted root mean square pre¬ 
diction error (WRMS) after variance-weighted regression. The 


evaluation results for all the models being compared are given in 
Table I. For every one of these criteria, MSSIM performs better 
than all of the other models being compared. 

V. Discussion 

In this paper, we have summarized the traditional approach 
to image quality assessment based on error-sensitivity, and have 
enumerated its limitations. We have proposed the use of struc¬ 
tural similarity as an alternative motivating principle for the de¬ 
sign of image quality measures. To demonstrate our structural 
similarity concept, we developed an SSIM index and showed 
that it compares favorably with other methods in accounting 
for our experimental measurements of subjective quality of 344 
JPEG and JPEG2000 compressed images. 

Although the proposed SSIM index method is motivated 
from substantially different design principles, we see it as 
complementary to the traditional approach. Careful analysis 
shows that both the SSIM index and several recently developed 
divisive-normalization based masking models exhibit input-de¬ 
pendent behavior in measuring signal distortions [45], [46], 
[48]. It seems possible that the two approaches may eventually 
converge to similar solutions. 
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TABLE I 

Performance Comparison of Image Quality Assessment Models. CC: Correlation Coefficient; MAE: Mean Absolute Error; RMS: Root Mean 
Squared Error; OR: Outlier Ratio; WMAE: Weighted Mean Absolute Error; WRMS: Weighted Root Mean Squared 
Error; SROCC; Spearman Rank-Order Correlation Coefficient 



Non-linear Regression 

Variance-weighted Regression 

Rank-order 

Model 

CC 

MAE 

RMS 

OR 

CC 

WMAE 

WRMS 

OR 

SROCC 

PSNR 

0.905 

6.53 

8.45 

0.157 

0.903 

6.18 

8.26 

0.140 

0.901 

Sarnoff 

0.956 

4.66 

5.81 

0.064 

0.956 

4.42 

5.62 

0.061 

0.947 

UQI 

0.S66 

7.76 

9.90 

0.189 

0.861 

7.64 

9.79 

0.195 

0.863 

MSSIM 

0.967 

3.95 

5.06 

0.041 

0.967 

3.79 

4.87 

0.041 

0.963 


There are a number of issues that are worth investigation with 
regard to the specific SSIM index of (12). First, the optimization 
of the SSIM index for various image processing algorithms 
needs to be studied. For example, it may be employed for 
rate-distortion optimizations in the design of image compression 
algorithms. This is not an easy task since (12) is mathematically 
more cumbersome than MSE. Second, the application scope 
of the SSIM index may not be restricted to image processing. 
In fact, because it is a symmetric measure, it can be thought 
of as a similarity measure for comparing any two signals. The 
signals can be either discrete or continuous, and can live in 
a space of arbitrary dimensionality. 

We consider the proposed SSIM indexing approach as a par¬ 
ticular implementation of the philosophy of structural similarity, 
from an image formation point of view. Under the same phi¬ 
losophy, other approaches may emerge that could be signifi¬ 
cantly different from the proposed SSIM indexing algorithm. 
Creative investigation of the concepts of structural information 
and structural distortion are likely to drive the success of these 
innovations. 
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